Data Visualization with Python

week 0
tutorials
Author

Sihui Lin

Published

January 22, 2025

25-COM SCI-M148 Project 1

Name: Sihui Lin

UID: 106013406

Submission Guidelines (Due: Jan 29 before the class)

  1. Please fill in your name and UID above.

  2. Please submit a PDF printout of your Jupyter Notebook to Gradescope. If you have any trouble accessing Gradescope, please let a TA know ASAP.

  3. When submitting to Gradescope, you will be taken to a page that asks you to assign questions and pages. As the PDF can get long, please make sure to assign pages to corresponding questions to ensure the readers know where to look.

Introduction

Welcome to CS148 - Introduction to Data Science! As we’re planning to move through topics aggressively in this course, to start out, we’ll look to do an end-to-end walkthrough of a datascience project, and then ask you to replicate the code yourself for a new dataset.

Please note: We don’t expect you to fully grasp everything happening here in either code or theory. This content will be reviewed throughout the quarter. Rather we hope that by giving you the full perspective on a data science project it will better help to contextualize the pieces as they’re covered in class

In that spirit, we will first work through an example project from end to end to give you a feel for the steps involved.

Here are the main steps:

  1. Get the data
  2. Visualize the data for insights
  3. Preprocess the data for your machine learning algorithm
  4. Select a machine learning model and train it
  5. Evaluate its performance

Working with Real Data

It is best to experiment with real-data as opposed to aritifical datasets.

There are many different open datasets depending on the type of problems you might be interested in!

Here are a few data repositories you could check out: - UCI Datasets - Kaggle Datasets - AWS Datasets

Below we will run through an California Housing example collected from the 1990’s.

Setup

We’ll start by importing a series of libraries we’ll be using throughout the project.

import sys
assert sys.version_info >= (3, 5) # python>=3.5
import sklearn
#assert sklearn.__version__ >= "0.20" # sklearn >= 0.20

import numpy as np #numerical package in python
%matplotlib inline
import matplotlib.pyplot as plt #plotting package

# to make this notebook's output identical at every run
np.random.seed(42)

#matplotlib magic for inline figures
%matplotlib inline
import matplotlib # plotting library
import matplotlib.pyplot as plt

Intro to Data Exploration Using Pandas

In this section we will load the dataset, and visualize different features using different types of plots.

Packages we will use: - Pandas: is a fast, flexibile and expressive data structure widely used for tabular and multidimensional datasets. - Matplotlib: is a 2d python plotting library which you can use to create quality figures (you can plot almost anything if you’re willing to code it out!) - other plotting libraries:seaborn, ggplot2

Note: If you’re working in CoLab for this project, the CSV file first has to be loaded into the environment. This can be done manually using the sidebar menu option, or using the following code here.

If you’re running this notebook locally on your device, simply proceed to the next step.

# from google.colab import files
# files.upload()

We’ll now begin working with Pandas. Pandas is the principle library for data management in python. It’s primary mechanism of data storage is the dataframe, a two dimensional table, where each column represents a datatype, and each row a specific data element in the set.

To work with dataframes, we have to first read in the csv file and convert it to a dataframe using the code below.

# We'll now import the holy grail of python datascience: Pandas!
import pandas as pd
housing = pd.read_csv('housing.csv')
housing.head() # show the first few elements of the dataframe
               # typically this is the first thing you do
               # to see how the dataframe looks like
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY

A dataset may have different types of features - real valued - Discrete (integers) - categorical (strings) - Boolean

The two categorical features are essentialy the same as you can always map a categorical string/character to an integer.

In the dataset example, all our features are real valued floats, except ocean proximity which is categorical.

# to see a concise summary of data types, null values, and counts
# use the info() method on the dataframe
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
# you can access individual columns similarly
# to accessing elements in a python dict
housing["ocean_proximity"].head() # added head() to avoid printing many columns..
0    NEAR BAY
1    NEAR BAY
2    NEAR BAY
3    NEAR BAY
4    NEAR BAY
Name: ocean_proximity, dtype: object
# to access a particular row we can use iloc
housing.iloc[1]
longitude              -122.22
latitude                 37.86
housing_median_age        21.0
total_rooms             7099.0
total_bedrooms          1106.0
population              2401.0
households              1138.0
median_income           8.3014
median_house_value    358500.0
ocean_proximity       NEAR BAY
Name: 1, dtype: object
# one other function that might be useful is
# value_counts(), which counts the number of occurences
# for categorical features
housing["ocean_proximity"].value_counts()
ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64
# The describe function compiles your typical statistics for each
# column
housing.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

If you want to learn about different ways of accessing elements or other functions it’s useful to check out the getting started section here

Let’s start visualizing the dataset

# We can draw a histogram for each of the dataframes features
# using the hist function
housing.hist(bins=50, figsize=(20,15))
# save_fig("attribute_histogram_plots")
plt.show() # pandas internally uses matplotlib, and to display all the figures
           # the show() function must be called

# if you want to have a histogram on an individual feature:
housing["median_income"].hist()
plt.show()

We can convert a floating point feature to a categorical feature by binning or by defining a set of intervals.

For example, to bin the households based on median_income we can use the pd.cut function

# assign each bin a categorical value [1, 2, 3, 4, 5] in this case.
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

housing["income_cat"].value_counts()
income_cat
3    7236
2    6581
4    3639
5    2362
1     822
Name: count, dtype: int64
housing["income_cat"].hist()

Next let’s visualize the household incomes based on latitude & longitude coordinates

## here's a not so interestting way plotting it
housing.plot(kind="scatter", x="longitude", y="latitude")

# we can make it look a bit nicer by using the alpha parameter,
# it simply plots less dense areas lighter.
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

# A more interesting plot is to color code (heatmap) the dots
# based on income. The code below achieves this

# Please note: In order for this to work, ensure that you've loaded an image
# of california (california.png) into this directory prior to running this

import matplotlib.image as mpimg
california_img=mpimg.imread('california.png')
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                       s=housing['population']/100, label="Population",
                       c="median_house_value", cmap=plt.get_cmap("jet"),
                       colorbar=False, alpha=0.4,
                      )
# overlay the califronia map on the plotted scatter plot
# note: plt.imshow still refers to the most recent figure
# that hasn't been plotted yet.
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
           cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

# setting up heatmap colors based on median_house_value feature
prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cb = plt.colorbar()
cb.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cb.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)
plt.show()
/var/folders/8r/cg47v_px0r11bynb_33svf300000gn/T/ipykernel_16151/2129115766.py:26: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  cb.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)

Not suprisingly, the most expensive houses are concentrated around the San Francisco/Los Angeles areas.

Up until now we have only visualized feature histograms and basic statistics.

When developing machine learning models the predictiveness of a feature for a particular target of intrest is what’s important.

It may be that only a few features are useful for the target at hand, or features may need to be augmented by applying certain transfomrations.

None the less we can explore this using correlation matrices.

# Select only numeric columns
numeric_housing = housing.select_dtypes(include=[float, int])

# Compute the correlation matrix
corr_matrix = numeric_housing.corr()
# for example if the target is "median_house_value", most correlated features can be sorted
# which happens to be "median_income". This also intuitively makes sense.
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value    1.000000
median_income         0.688075
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049686
population           -0.024650
longitude            -0.045967
latitude             -0.144160
Name: median_house_value, dtype: float64
# the correlation matrix for different attributes/features can also be plotted
# some features may show a positive correlation/negative correlation or
# it may turn out to be completely random!
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
array([[<Axes: xlabel='median_house_value', ylabel='median_house_value'>,
        <Axes: xlabel='median_income', ylabel='median_house_value'>,
        <Axes: xlabel='total_rooms', ylabel='median_house_value'>,
        <Axes: xlabel='housing_median_age', ylabel='median_house_value'>],
       [<Axes: xlabel='median_house_value', ylabel='median_income'>,
        <Axes: xlabel='median_income', ylabel='median_income'>,
        <Axes: xlabel='total_rooms', ylabel='median_income'>,
        <Axes: xlabel='housing_median_age', ylabel='median_income'>],
       [<Axes: xlabel='median_house_value', ylabel='total_rooms'>,
        <Axes: xlabel='median_income', ylabel='total_rooms'>,
        <Axes: xlabel='total_rooms', ylabel='total_rooms'>,
        <Axes: xlabel='housing_median_age', ylabel='total_rooms'>],
       [<Axes: xlabel='median_house_value', ylabel='housing_median_age'>,
        <Axes: xlabel='median_income', ylabel='housing_median_age'>,
        <Axes: xlabel='total_rooms', ylabel='housing_median_age'>,
        <Axes: xlabel='housing_median_age', ylabel='housing_median_age'>]],
      dtype=object)

# median income vs median house vlue plot plot 2 in the first row of top figure
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.axis([0, 16, 0, 550000])

Preparing Dastaset for ML

Dealing With Incomplete Data

# have you noticed when looking at the dataframe summary certain rows
# contained null values? we can't just leave them as nulls and expect our
# model to handle them for us...
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity income_cat
290 -122.16 37.77 47.0 1256.0 NaN 570.0 218.0 4.3750 161900.0 NEAR BAY 3
341 -122.17 37.75 38.0 992.0 NaN 732.0 259.0 1.6196 85100.0 NEAR BAY 2
538 -122.28 37.78 29.0 5154.0 NaN 3741.0 1273.0 2.5762 173400.0 NEAR BAY 2
563 -122.24 37.75 45.0 891.0 NaN 384.0 146.0 4.9489 247100.0 NEAR BAY 4
696 -122.10 37.69 41.0 746.0 NaN 387.0 161.0 3.9063 178400.0 NEAR BAY 3
sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # option 1: simply drop rows that have null values
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity income_cat
sample_incomplete_rows.drop("total_bedrooms", axis=1)       # option 2: drop the complete feature
longitude latitude housing_median_age total_rooms population households median_income median_house_value ocean_proximity income_cat
290 -122.16 37.77 47.0 1256.0 570.0 218.0 4.3750 161900.0 NEAR BAY 3
341 -122.17 37.75 38.0 992.0 732.0 259.0 1.6196 85100.0 NEAR BAY 2
538 -122.28 37.78 29.0 5154.0 3741.0 1273.0 2.5762 173400.0 NEAR BAY 2
563 -122.24 37.75 45.0 891.0 384.0 146.0 4.9489 247100.0 NEAR BAY 4
696 -122.10 37.69 41.0 746.0 387.0 161.0 3.9063 178400.0 NEAR BAY 3
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3: replace na values with median values
sample_incomplete_rows
/var/folders/8r/cg47v_px0r11bynb_33svf300000gn/T/ipykernel_16151/3420772705.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3: replace na values with median values
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity income_cat
290 -122.16 37.77 47.0 1256.0 435.0 570.0 218.0 4.3750 161900.0 NEAR BAY 3
341 -122.17 37.75 38.0 992.0 435.0 732.0 259.0 1.6196 85100.0 NEAR BAY 2
538 -122.28 37.78 29.0 5154.0 435.0 3741.0 1273.0 2.5762 173400.0 NEAR BAY 2
563 -122.24 37.75 45.0 891.0 435.0 384.0 146.0 4.9489 247100.0 NEAR BAY 4
696 -122.10 37.69 41.0 746.0 435.0 387.0 161.0 3.9063 178400.0 NEAR BAY 3

Now that we’ve played around with this, lets finalize this approach by replacing the nulls in our final dataset

housing["total_bedrooms"].fillna(median, inplace=True)
/var/folders/8r/cg47v_px0r11bynb_33svf300000gn/T/ipykernel_16151/1782877594.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  housing["total_bedrooms"].fillna(median, inplace=True)

Could you think of another plausible imputation for this dataset?

Augmenting Features

New features can be created by combining different columns from our data set.

  • rooms_per_household = total_rooms / households
  • bedrooms_per_room = total_bedrooms / total_rooms
  • etc.
housing["rooms_per_household"] = housing["total_rooms"]/(housing["households"] + 1e-6)
housing["bedrooms_per_room"] = housing["total_bedrooms"]/(housing["total_rooms"] + 1e-6)
housing["population_per_household"]=housing["population"]/(housing["households"] + 1e-6)
housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
             alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()

Dealing with Non-Numeric Data

So we’re almost ready to feed our dataset into a machine learning model, but we’re not quite there yet!

Generally speaking all models can only work with numeric data, which means that if you have Categorical data you want included in your model, you’ll need to do a numeric conversion. We’ll explore this more later, but for now we’ll take one approach to converting our ocean_proximity field into a numeric one.

from sklearn.preprocessing import LabelEncoder

# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
housing['ocean_proximity'] = labelencoder.fit_transform(housing['ocean_proximity'])
housing.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity income_cat rooms_per_household bedrooms_per_room population_per_household
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 3 5 6.984127 0.146591 2.555556
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 3 5 6.238137 0.155797 2.109842
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 3 5 8.288136 0.129516 2.802260
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 3 4 5.817352 0.184458 2.547945
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 3 3 6.281853 0.172096 2.181467

Divide up the Dataset for Machine Learning

After having cleaned your dataset you’re ready to train your machine learning model.

To do so you’ll aim to divide your data into: - train set - test set

In some cases you might also have a validation set as well for tuning hyperparameters (don’t worry if you’re not familiar with this term yet..)

In supervised learning setting your train set and test set should contain (feature, target) tuples. - feature: is the input to your model - target: is the ground truth label - when target is categorical the task is a classification task - when target is floating point the task is a regression task

We will make use of scikit-learn python package for preprocessing.

Scikit learn is pretty well documented and if you get confused at any point simply look up the function/object!

from sklearn.model_selection import StratifiedShuffleSplit
# let's first start by creating our train and test sets
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    train_set = housing.loc[train_index]
    test_set = housing.loc[test_index]
housing_training = train_set.drop("median_house_value", axis=1) # drop labels for training set features
                                                       # the input to the model should not contain the true label
housing_labels = train_set["median_house_value"].copy()
housing_testing = test_set.drop("median_house_value", axis=1) # drop labels for training set features
                                                       # the input to the model should not contain the true label
housing__test_labels = test_set["median_house_value"].copy()

Select a model and train

Once we have prepared the dataset it’s time to choose a model.

As our task is to predict the median_house_value (a floating value), regression is well suited for this.

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_training, housing_labels)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# let's try our model on a few testing instances
data = housing_testing.iloc[:5]
labels = housing__test_labels.iloc[:5]

print("Predictions:", np.round(lin_reg.predict(data), 1))
print("Actual labels:", list(labels))
Predictions: [418197.2 305620.5 232253.  188754.6 251166.4]
Actual labels: [500001.0, 162500.0, 204600.0, 159700.0, 184000.0]

We can evaluate our model using certain metrics, a fitting metric for regresison is the mean-squared-loss

\[L(\hat{Y}, Y) = \frac{1}{N} \sum_i^N (\hat{y_i} - y_i)^2\]

where \(\hat{y}\) is the predicted value, and y is the ground truth label.

from sklearn.metrics import mean_squared_error

preds = lin_reg.predict(housing_testing)
mse = mean_squared_error(housing__test_labels, preds)
rmse = np.sqrt(mse)
rmse
67694.08184344384

Is this a good result? What do you think an acceptable error rate is for this sort of problem?

TODO: Applying the end-end ML steps to a different dataset.

Ok now it’s time to get to work! We will apply what we’ve learnt to another dataset (airbnb dataset). For this project we will attempt to predict the airbnb rental price based on other features in our given dataset.

Visualizing Data

Load the data + statistics

Let’s do the following set of tasks to get us warmed up: - load the dataset - display the first few rows of the data - drop the following columns: name, host_id, host_name, last_review, neighbourhood - display a summary of the statistics of the loaded data

import pandas as pd
airbnb =  pd.read_csv('AB_NYC_2019.csv') # we load the pandas dataframe
airbnb_drop = airbnb.drop(columns = ["name", "host_id", "host_name", "last_review", "neighbourhood"])
airbnb_drop.describe()
id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
count 4.889500e+04 48895.000000 48895.000000 48895.000000 48895.000000 48895.000000 38843.000000 48895.000000 48895.000000
mean 1.901714e+07 40.728949 -73.952170 152.720687 7.029962 23.274466 1.373221 7.143982 112.781327
std 1.098311e+07 0.054530 0.046157 240.154170 20.510550 44.550582 1.680442 32.952519 131.622289
min 2.539000e+03 40.499790 -74.244420 0.000000 1.000000 0.000000 0.010000 1.000000 0.000000
25% 9.471945e+06 40.690100 -73.983070 69.000000 1.000000 1.000000 0.190000 1.000000 0.000000
50% 1.967728e+07 40.723070 -73.955680 106.000000 3.000000 5.000000 0.720000 1.000000 45.000000
75% 2.915218e+07 40.763115 -73.936275 175.000000 5.000000 24.000000 2.020000 2.000000 227.000000
max 3.648724e+07 40.913060 -73.712990 10000.000000 1250.000000 629.000000 58.500000 327.000000 365.000000
airbnb_drop.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   neighbourhood_group             48895 non-null  object 
 2   latitude                        48895 non-null  float64
 3   longitude                       48895 non-null  float64
 4   room_type                       48895 non-null  object 
 5   price                           48895 non-null  int64  
 6   minimum_nights                  48895 non-null  int64  
 7   number_of_reviews               48895 non-null  int64  
 8   reviews_per_month               38843 non-null  float64
 9   calculated_host_listings_count  48895 non-null  int64  
 10  availability_365                48895 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 4.1+ MB

Some Basic Visualizations

Let’s try another popular python graphics library: Plotly.

You can find documentation and all the examples you’ll need here: Plotly Documentation

Let’s start out by getting a better feel for the distribution of rentals in the market.

Generate a pie chart showing the distribution of room type (room_type in the dataset) across NYC’s ‘Manhattan’ Buroughs (fitlered by neighbourhood_group in the dataset)

import plotly.express as px
manhattan = airbnb_drop[airbnb_drop["neighbourhood_group"] == "Manhattan"]
fig = px.pie(manhattan,
             names = "room_type", 
             title = "Distribution of room type across NYC's Manhattan Boroughs")
fig.show()

Plot the total number_of_reviews per room_type

We now want to see the total number of reviews left for each room type group in the form of a Bar Chart (where the X-axis is the room type group and the Y-axis is a count of review.

This is a two step process: 1. You’ll have to sum up the reviews per room type group (hint! try using the groupby function) 2. Then use Plotly to generate the graph

room = airbnb_drop.groupby("room_type")["number_of_reviews"].sum().reset_index()
room.head()
room_type number_of_reviews
0 Entire home/apt 580403
1 Private room 538346
2 Shared room 19256
fig = px.bar(room,
            x = "room_type",
            y = "number_of_reviews",
            title = "Total Number of Reviews By Room Type")
fig.show()

Plot a map of airbnbs throughout New York (if it gets too crowded take a subset of the data, and try to make it look nice if you can :) ).

For reference you can use the Matplotlib code above to replicate this graph here.

airbnb.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

miniairbnb = airbnb_drop.sample(n=1000, random_state=1)
# A more interesting plot is to color code (heatmap) the dots
# based on price. The code below achieves this

# load an image of New York
import matplotlib.image as mpimg
nyc_img=mpimg.imread('nyc.png', -1)

ax = miniairbnb.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                     s=miniairbnb['number_of_reviews'], label="Number of Reviews",
                     c="price", cmap=plt.get_cmap("jet"),
                     colorbar=False, alpha=0.4,
                     vmin=0, vmax = 500
                      )

# overlay the NYC map on the plotted scatter plot
# note: plt.imshow still refers to the most recent figure
# that hasn't been plotted yet.

# find the extent of the coordinates
min_longitude = airbnb["longitude"].min()
max_longitude = airbnb["longitude"].max()
min_latitude = airbnb["latitude"].min()
max_latitude = airbnb["latitude"].max()

plt.imshow(nyc_img, 
           extent=[min_longitude, max_longitude, min_latitude-0.01, max_latitude], 
           alpha=0.5,
           cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

# setting up heatmap colors based on price feature
prices = miniairbnb["price"]
tick_values = np.linspace(0, 500, 6)  # Creates 6 ticks from 0 to 500
cb = plt.colorbar()
cb.ax.set_yticklabels(["$%d"%(v) for v in tick_values], fontsize=14)
cb.set_label('Price', fontsize=16)
plt.legend(fontsize=16)
plt.title("Airbnb Locations in NYC by Price", fontsize=16)
plt.show()
/var/folders/8r/cg47v_px0r11bynb_33svf300000gn/T/ipykernel_16151/4088040779.py:36: UserWarning:

set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.

Now try to recreate this plot using Plotly’s Scatterplot functionality. Note that the increased interactivity of the plot allows for some very cool functionality

fig = px.scatter(miniairbnb, x="longitude", y="latitude",
                 size="number_of_reviews", color = "price", range_color = [0, 500],
                title="Airbnb Locations in NYC by Price",
                range_x=[min_longitude, max_longitude],
                range_y=[min_latitude, max_latitude])

import base64
#set a local image as a background
image_filename = 'nyc.png'
plotly_logo = base64.b64encode(open(image_filename, 'rb').read())

# WRITE YOUR CODE HERE #
fig.add_layout_image(dict(
    source = 'data:image/png;base64,{}'.format(plotly_logo.decode()),
    xref="x", yref="y",
    x = min_longitude, 
    y = max_latitude-0.01,
    sizex = max_longitude - min_longitude,
    sizey = max_latitude - min_latitude,
    opacity=0.5,
    layer = "below"
))

fig.update_layout(
    width=800, height=800)

fig.show()

Use Plotly to plot the average price of room types in Brooklyn who have at least 10 Reviews.

Like with the previous example you’ll have to do a little bit of data engineering before you actually generate the plot.

Generally I’d recommend the following series of steps: 1. Filter the data by neighborhood group and number of reviews to arrive at the subset of data relevant to this graph. 2. Groupby the room type 3. Take the mean of the price for each roomtype group 4. FINALLY (seriously!?!?) plot the result

subgroup = airbnb_drop[(airbnb_drop["neighbourhood_group"] == "Brooklyn") & (airbnb_drop["number_of_reviews"] >= 10)]
subgroup = subgroup.groupby(by = "room_type")["price"].mean().reset_index()
# WRITE YOUR CODE HERE #
fig = px.bar(subgroup, x = "room_type", y = "price",
            title = "Average Price of Room Types in Brooklyn with 10+ Reviews")
fig.show()

Prepare the Data

airbnb_drop.head()
id neighbourhood_group latitude longitude room_type price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
0 2539 Brooklyn 40.64749 -73.97237 Private room 149 1 9 0.21 6 365
1 2595 Manhattan 40.75362 -73.98377 Entire home/apt 225 1 45 0.38 2 355
2 3647 Manhattan 40.80902 -73.94190 Private room 150 3 0 NaN 1 365
3 3831 Brooklyn 40.68514 -73.95976 Entire home/apt 89 1 270 4.64 1 194
4 5022 Manhattan 40.79851 -73.94399 Entire home/apt 80 10 9 0.10 1 0

Feature Engineering

Let’s create a new binned feature, price_cat that will divide our dataset into quintiles (1-5) in terms of price level (you can choose the levels to assign)

Do a value count to check the distribution of values

# assign each bin a categorical value [1, 2, 3, 4, 5] in this case.
airbnb_drop["price_cat"] = pd.qcut(airbnb_drop["price"], q = 5,
                                   labels = [1, 2, 3, 4, 5])

airbnb_drop["price_cat"].value_counts()
price_cat
4    10809
1    10063
2     9835
3     9804
5     8384
Name: count, dtype: int64

Data Imputation

Determine if there are any null-values and impute them.

airbnb_drop.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   id                              48895 non-null  int64   
 1   neighbourhood_group             48895 non-null  object  
 2   latitude                        48895 non-null  float64 
 3   longitude                       48895 non-null  float64 
 4   room_type                       48895 non-null  object  
 5   price                           48895 non-null  int64   
 6   minimum_nights                  48895 non-null  int64   
 7   number_of_reviews               48895 non-null  int64   
 8   reviews_per_month               38843 non-null  float64 
 9   calculated_host_listings_count  48895 non-null  int64   
 10  availability_365                48895 non-null  int64   
 11  price_cat                       48895 non-null  category
dtypes: category(1), float64(3), int64(6), object(2)
memory usage: 4.2+ MB
# drop rows with null values
airbnb_drop = airbnb_drop.dropna(subset=["reviews_per_month"]).reset_index(drop=True)
airbnb_drop.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38843 entries, 0 to 38842
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   id                              38843 non-null  int64   
 1   neighbourhood_group             38843 non-null  object  
 2   latitude                        38843 non-null  float64 
 3   longitude                       38843 non-null  float64 
 4   room_type                       38843 non-null  object  
 5   price                           38843 non-null  int64   
 6   minimum_nights                  38843 non-null  int64   
 7   number_of_reviews               38843 non-null  int64   
 8   reviews_per_month               38843 non-null  float64 
 9   calculated_host_listings_count  38843 non-null  int64   
 10  availability_365                38843 non-null  int64   
 11  price_cat                       38843 non-null  category
dtypes: category(1), float64(3), int64(6), object(2)
memory usage: 3.3+ MB

Numeric Conversions

Finally, review what features in your dataset are non-numeric and convert them.

# convert room type and neighbourhood group
from sklearn.preprocessing import LabelEncoder

# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
airbnb_drop["room_type"] = labelencoder.fit_transform(airbnb_drop["room_type"])
airbnb_drop["neighbourhood_group"] = labelencoder.fit_transform(airbnb_drop["neighbourhood_group"])

# convert price_cat from category to int
airbnb_drop["price_cat"] = airbnb_drop["price_cat"].astype(int)

airbnb_drop.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38843 entries, 0 to 38842
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              38843 non-null  int64  
 1   neighbourhood_group             38843 non-null  int64  
 2   latitude                        38843 non-null  float64
 3   longitude                       38843 non-null  float64
 4   room_type                       38843 non-null  int64  
 5   price                           38843 non-null  int64  
 6   minimum_nights                  38843 non-null  int64  
 7   number_of_reviews               38843 non-null  int64  
 8   reviews_per_month               38843 non-null  float64
 9   calculated_host_listings_count  38843 non-null  int64  
 10  availability_365                38843 non-null  int64  
 11  price_cat                       38843 non-null  int64  
dtypes: float64(3), int64(9)
memory usage: 3.6 MB

Prepare Data for Machine Learning

Using our StratifiedShuffleSplit function example from above, let’s split our data into a 80/20 Training/Testing split using price_cat to partition the dataset

from sklearn.model_selection import StratifiedShuffleSplit
# let's first start by creating our train and test sets

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(airbnb_drop, airbnb_drop["price_cat"]):
    train_set = airbnb_drop.loc[train_index]
    test_set = airbnb_drop.loc[test_index]
test_set.info()
<class 'pandas.core.frame.DataFrame'>
Index: 7769 entries, 19003 to 4113
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              7769 non-null   int64  
 1   neighbourhood_group             7769 non-null   int64  
 2   latitude                        7769 non-null   float64
 3   longitude                       7769 non-null   float64
 4   room_type                       7769 non-null   int64  
 5   price                           7769 non-null   int64  
 6   minimum_nights                  7769 non-null   int64  
 7   number_of_reviews               7769 non-null   int64  
 8   reviews_per_month               7769 non-null   float64
 9   calculated_host_listings_count  7769 non-null   int64  
 10  availability_365                7769 non-null   int64  
 11  price_cat                       7769 non-null   int64  
dtypes: float64(3), int64(9)
memory usage: 789.0 KB

Finally, remove your labels price and price_cat from your testing and training cohorts, and create separate label features.

training = train_set.drop(["price", "price_cat"], axis = 1)
training_labels = train_set[["price", "price_cat"]].copy()

testing = test_set.drop(["price", "price_cat"], axis = 1)
testing_labels = test_set[["price", "price_cat"]].copy()
training.head()
id neighbourhood_group latitude longitude room_type minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
22913 21610946 1 40.67411 -73.96532 1 1 55 2.76 1 63
20696 19913070 1 40.69723 -73.93769 0 3 29 1.22 1 13
10021 9049582 2 40.77370 -73.95616 2 1 17 0.38 1 0
38479 35681857 1 40.68884 -73.97330 0 3 1 1.00 1 5
4308 3322922 1 40.65320 -73.96216 0 18 11 0.20 1 0

Fit a linear regression model

The task is to predict the price, you could refer to the housing example on how to train and evaluate your model using MSE. Provide both test and train set MSE values.

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(training, training_labels)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# let's try our model on a few testing instances
data = testing.iloc[:5]
labels = testing_labels.iloc[:5]

print("Predictions:", np.round(lin_reg.predict(data), 1))
print("Actual labels:", list(labels))
Predictions: [[107.3   2.3]
 [ 66.2   1.8]
 [ 51.2   1.4]
 [199.4   3.9]
 [ 48.4   1.5]]
Actual labels: ['price', 'price_cat']
from sklearn.metrics import mean_squared_error

test_preds = lin_reg.predict(testing)
test_mse = mean_squared_error(testing_labels, test_preds)
test_mse
13457.26780207614
train_preds = lin_reg.predict(training)
train_mse = mean_squared_error(training_labels, train_preds)
train_mse
18285.773735586783